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Abstract 

We consider the problem of learning regression functions from pairwise data 
when there exists prior knowledge that the relation to be learned is symmetric or 
anti-symmetric. Such prior knowledge is commonly enforced by symmetrizing or 
anti-symmetrizing pairwise kernel functions. Through spectral analysis, we show 
that these transformations reduce the kernel’s effective dimension. Further, we 
provide an analysis of the approximation properties of the resulting kernels, and 
bound the regularization bias of the kernels in terms of the corresponding bias of 
the original kernel. 


1 Introduction 


Many real-world phenomena can be described in terns of pairwise relationships be¬ 
tween entities. When learning pairwise relations, symmetry and anti-symmetry are two 
types of prior knowledge constraints that commonly appear when both of the objects 
in a pair belong to the same domain. A typical example of an application where rela¬ 
tionships are often assumed to be symmetric is the prediction of protein-protein interac¬ 
tions; if protein A interacts with protein B, then conversely it also holds that B interacts 
with A. Typical example of an anti-symmetric relation would be a preference relation; 
if A is preferred over B, then conversely B is not preferred over A. Commonly used 


symm etric pairwise kernels include t he sym metrized Kronecker llBen-Hur and Noble , 
2005] and Cartesian I Kashima et al.L 2009 1. as well as the metric learning llVert et al 


200711 kernels. Such kernels are analyzed in more detail by [Brunner et al.l 020 1 21]. Typ¬ 
ical e xamples of anti-symmetric kernels are the transitiye kernel of jlTerbrich et al. 


200011 used for l earnin g to rank, and the anti-symmetric Kronecker product kernel 
I Pahikkala et al. . 201(11 for learning intransitiye preference relations. 
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Kernel-based learning algorithms are some of the most successful learning meth¬ 
ods in practise and they also enjoy strong theoretical properties. It is well known in 
the machine learning literature that the eigenvalues and eigenfunctions of the integral 
operator of the kernel play a central role in obtaining error estimates in learning the¬ 
ory. One of the most intensively studied quantities depending on the eigenvalue s is the 


so-cal led ejfective dimension of the kernel , which has since its introduction bvIZhan 
j|2002 ] been used by several other authors | Mendelso^ 2003^ Caponnetto and De Vit* 
2007 1. For a recent summary of these results, see Hsu et al. I 2014ll and references 
therein. Therefore, the determination of the operator’s eigensystem is important in 
its own right . Another impor tant tool for analysis is the theory of universal kernels 
pioneered by Steinwarl i2002ll . which indicates that if a kernel has the so-called unver- 
sality property, the corresponding hypothesis space can approximate any continuous 
function arbitrarily well. 

Intuitively it seems plausible that enforcing prior knowledge about symmetry or 
anti-symmetry should result in better generalization, and many promising experimen¬ 
tal results have been obtained in the literature (see previous references). However, 
thus far rigorous theoretical analysisis of the effects that enforcing these properties 
on the kernel function has on learning has be en missing in the literature. As a step 
towards this direction IWaegeman et akl 1120121] have shown that when symmetrizing 
or anti-symmetrizing pairwise kernels that are formed by taking the Kronecker prod¬ 
uct of two universal kernels, the resulting kernel allows approximating arbitrarily well 
any symmetric or anti-symmetric continuous function. While these results show that 
symmetrization or anti-symmetrization does not sacrifice expressive power needed for 
learning, the results concern only Kronecker product kernels, and do not provide any 
guarantees that learning would be more efficient with the transformed kernels. 

Following are the main contributions and results of our paper; 


• The effective dimension of both the symmetrized and anti-symmetrized versions 
of a pairwise kernel are smaller than that of the original pairwise kernel (see 
Theorem l4.3l) . 

• The approximation properties of the symmetric and anti-symmetric kernels are 
analysed (see Theorem l4.61 l. 

• We bound the regularization bias of the symmetric and anti-symmetric kernels 
in terms of the regularization bias of the original kernel (see Theorem l4.91 l. 


2 Preliminaries 

Definition 2.1 (Kernel function). For any set X, the function K is a kernel if it can be 
written as the following type of an inner product: 

K(x,x) = {^{x),^{x)) , 


where 


$ : A- ^ -H® 


is a mapping from X to a Hilbert space popularly called the feature space in the 

literature. Conversely, any kernel can be written as the above type of an inner product. 
However, neither the feature mapping nor the feature space are unique. 
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To simplify the forthcoming considerations, we make a couple of extra assump¬ 
tions of the input space and kernels. Namely, we assume that the input space X is 
compact (e.g. closed and bounded) and the kernel functions considered in this article 
are continuous. Let p be a probability distribution over X generating the data. We also 
assume that p is a probability density with respect to a Lebesque measure (e.g. we can 
write h{x)dfj,{x) = h{x)fj,{x)dx for any function h). 

We make use of the Hilbert space L‘^{X,^) of square integrable functions on 
{X, p) with the inner product {h, g)h{x)g{x)dg{x). The elements of the 
space {X^ g) are equivalence classes of functions rather that individual functions but 
this technical detail has no effect on the considerations below. 


Definition 2.2 (I Aronszain , 1950ll l. For each real-valued kernel K and an input space 
X, there exists a unique Hilbert space 'H{K) known as the reproducing kernel Hilbert 
space (RKHS): 


1. e H{K) 'dx e A”, where 


are functions such that Kx{x) = K(x,x) 

2. &pan({Kx}x^Pst) is dense in 'H{K) 

3. The inner product (•, ■)'n{K) associated with 'H(iT) satisfies: 

fix) = {f,Kx) Vf€n{K), x€X 
which is known as the reproducing property. In particular, 
K{x,x) = {Kx, Kx) Vx,!: € X . 


In the literature, the mapping; 

<^k-x^KxG HiK) 

is often referred to as the canonical feature map of the kernel. 

Definition 2.3 (Integral operator of a kernel). The probability distribution g over X 
yields a linear operator 

Vk : L‘^{X,g) ^H{K) 


defined as 

Ux/i = / Kxh{x)dg{x) . 

J X 

The adjoint of this operator is the inclusion : 'H{K) ^ Lf{X, g), that is, 

Kh,g)-u{K) = {hi^*K9)L'^(x,n) ■ ( 1 ) 

Note the RKHS norm on the left hand side, determined by the reproducing property, 
being changed to the Lf{X,g) norm on the right. The composition offJx with its 
adjoint is the operator: 


TK-.L\X,g)^L^{X,g) . 

for all h G L^{X,g). This decomposition is illustrated in the following commutative 
diagram: 
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L\X,y) 


Tk 



UiK) 


The operator Tk can be shown to be continuous, self-adjoint and Hilbert-Schmidt, 
the last property indicating that its eigenvalues are square-summable, which is charac¬ 
terized below in more detail. We next recollect some classical results from functional 
analysis required in the forthcoming considerations. 

Theorem 2.4 (Spectral theorem for compact operators). Suppose C is a Hilbert space 
and T :£—>■£ ii compact and self-adjoint linear operator. Then, C has an orthonor¬ 
mal basis {4>i}i consisting of eigenvectors ofT. 

To compress the forthcoming notation and to take advantage the machinery of op¬ 
erator algebra, we use the following expression for the eigen decomposition of the 
integral operators: 

T = VAV* , 

where V : Ci fi and A : i—>■ A^e^, with being the standard basis vectors of l"^. 

For the integral operators of continuous kernels on compact domains, we have the 
following result known as Mercer’s theorem: 

Theorem 2.5 (Mercer 1909). Suppose K is a continuous symmetric non-negative defi¬ 
nite kernel. Then there is an orthonormal basis ofL^{X) consisting of eigenfunc¬ 
tions ofTfc such that the corresponding sequence of eigenvalues {Ai}i is nonnegative. 
The eigenfunctions corresponding to non-zero eigenvalues are continuous on X and K 
has the representation 

K{x, x)=J2 (^) 

i6N 

where the convergence is absolute and uniform. 


The spectral theorem also yields the following corollary about commuting com pact 
and self-adjoint operators sharing the same eigen system (see e.g. IZimmen II1990I1 '): 


Corollary 2.6. Let T be a Hilbert space and let Ti : C ^ C and T 2 : C ^ C 
be compact and self-adjoint operators, such that T 1 T 2 = T 2 T 1 . Then there is an 
orthonormal basis {fj} of C such that fj an eigenvector for both Ti and T 2 . 


Next, we define the c oncept of majorization for sequences of infinite lengths (see 
e.g. Li and Buschl 1 201^ and references therein). 


Definition 2.7 (Majorization). Let r = £ Cq and s = £ Cq where Cq is 

the positive cone of sequences decreasing monotonically to 0. We say that s majorizes 
r, denoted as r ^ s if 


mm 00 00 

Vi < Si Vm £ N and Si . 

i—1 i—1 i=l 


In particular, for two trace class operators Ti and T 2 on a Hilbert space, we say that 
T 2 ^ Ti if the sequence of eigenvalues o/Ti majorizes the sequence of eigenvalues 
0/T2. 
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The next result is a recent generalization by iLi and Buschl 1120131] of the classical 
Uhlmann’s theorem for infinite dimensional Hilbe rt spaces. Before that , we also define 
the doubly-stochastic operations, which is also bv iLi and Busch 1 2013 1: 


Definition 2.8 (Doubly -stochastic operation). Let 7'{C) denote the (Banach) space of 
all trace class operators on a Hilbert space C. We say that operation T : T{C) —>■ 
T{C) is doubly-stochastic if it preserves trace (e.g. trace(T) = trace(r(T))), is 
unital indicating that I = r(I) for the identity operator I on the Hilbert space, and 
there exists a sequence of compact operators on the Hilbert space C, known 

in the literature as the Kraus operators, such that the operation can be written as 


r(T)=^E,TE*. (2) 

i=l 


Theorem 2.9 (Uhlmann’s theorem for infinite dimensional Hilbert spaces). IfTi and 
T 2 are trace-class operators on a Hilbert space, then T 2 ^ Ti iff there exists a 
doubly-stochastic operation T such thatT 2 = r(Ti). 


3 Pairwise Kernels 

Let us next define the family of pairwise kernels. Assume that the input space can be 
written as 

X 

where 7^ is a compact metric space. The kernels over 'P'^ can accordingly be written as 
the following types of inner products 

K(v,v',v,v') = , 

where v, v',v,v' £ V and $ is a joint feature mapping over a pair of inputs, that is, 
u') is a feature space representation for an ordered pair {v,v'). 

Next, we define certain specific types of pairwise kernels, starting from the per¬ 
muted kernel: 

Definition 3.1 (Permuted pairwise kernel). Let K{v,v',v,v') be an arbitrary kernel 
on V^. Then, its permuted pairwise kernel is 

(v,v\v,v') = K{v',v,v',v) . 


An immediate step forward is to define the following type of kernels that are in¬ 
variant to the permutations in the above defined sense: 

Definition 3.2 (Permutation invariant pairwise kernels). We say that a kernel 

(v,v',v,v') on is permutation invariant if it is equal to its permuted kernel, 

that is, 

{v,v',v,v') = (v',v,v',v) . 

A natural way to construct a permutation invariant kernel from a given pairwise kernel 
K is to consider the projection from the set of all kernels to the set of permutation 
invariant kernels: 

K^^{v, v',v,v') = - {K{v, v',v,v') -f K(v', v,v',v)) . 
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Our next step is to define the well-known symmetric pairwise kernels as well as 
their anti-symmetric counterparts: 

Definition 3.3 (Symmetric and anti-symmetric pairwise kernels). We say that a kernel 
{v,v',v,v') on is a symmetric pairwise kernel if 

(v,v',v,v') = (v',v,v,v') . 

Analogously, we say that a kernel [v ,v' ,v ,v') on is an antisymmetric pairwise 

kernel if 

K^{v,v',v,v') = —K^{v',v,v,v') . 

Similarly to the permutation invariance, one can construct symmetric and antisymmetric 
kernels from an arbitrary kernel K(v,v' ,v,v') with the following projections: 

{v,v' ,v,v') = 

,v,v') + K{v' ,v,v,v') + K{v,v' ,v' ,v) + K{v' ,v,v' ,v)^ 
and K^{v,v',v,v') = 

— (^K{v,v',v,v') — K{v',v,v,v') — K{v,v',v',v) + K{v',v,v',v)^ , 
respectively. 

The following connection between the symmetric, anti-symmetric and permutation 
invariant kernels is immediate: 

Lemma 3.4. Both the symmetric and antisymmetric pairwise kernels are permutation 
invariant. Moreover, if {v,v',v,v') and K^{v,v',v,v') are the symmetric and 
antisymmetric forms of a kernel K{v,v',v,v') obtained with the projections given 
in Definition \3.3\ then the permutation invariant fonn of the kernel obtained with the 
projection given in Definition 13.21 can be expressed as the sum of the symmetric and 
and antisymmetric forms: 

{v,v',v,v') =K^{v,v',v,v') + K"^{v,v',v,v') . 


3.1 Spectral Analysis of Pairwise Kernels 

We next study the relationship between the integral operators of the permutation in¬ 
variant, symmetric and anti-symmetric kernels to the corresponding integral operator 
of the original kernel they were constructed from. 

Theorem 3.5. Let K{v,v',v,v') be an arbitrary pairwise kernel and let 
and be its permutation invariant, symmetric and antisymmetric forms. Moreover, 
let Tk, T^p/, and be the integral operators of the kernels K, 
and K^, respectively. Then, 

T^P = 

Tks = 

T^Pr = ^ (Tif + P^*TKPn 

= S>^*TKSf^ + Af^*TKA^^ , 
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(3) 

(4) 


where 




h{y, v') !->■ 


h-WiV) 

tJ,{v,v') 


h{v\ v) 


is an operator to which we refer as the permutation operator with respect to the mea¬ 
sure pL, and whose adjoint is 


Hv,v ) !-)■ _, h{v ,v) , 

p.{v\v) 


and 

S^=i(I + p/^) 

=i(l-p^) 

2 ^ ’ 

are projection operators to which we refer as the symmetrizer and anti-symmetrizer 
with respect to the measure p, and I is the identity operator of p). 

See Section lSTI for a proof. 

Next, we look on what can be said about the spectrum of the integral operators 
considered in the above theorem. This consideration can be divided into the important 
special case of the measure p being symmetric, that is 


p{v,v') = p{v',v),y{v,v') e 


and to the general case. The measure is symmetric, for example, in various types of 
ranking and preference learning tasks as is considered more in detail below. In addition, 
many other pairwise learning problems with non-symmetric measure can be turned to 
problems with a symmetric measure by the technique known as virtual examples. That 
is, whenever a datum {v,v') is drawn from p, one also introduces a virtual example 
{v', v) with the same output if the problem is considered to be symmetric or with the 
opposite output in the anti-symmetric case. With symmetric p, the symmetrizer and 
anti-symmetrizer projections do not depend on the measure and we denote them simply 
as S and A. 

Corollary 3.6. If and Xf denote the eigenvalues ofTlx, and 

respectively, then 

< Xf and Xf < Xf for i = 1,2,... (5) 


If p is symmetric, the set of operators {S, A, , T^a , } commutes, which in 

turn indicates that they can be diagonalized simultaneously as follows: 


= YA^^V* , 
= VA^V* , 
= VA^V* , 

S = VI^V* , 

A = VI^V* , 
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where V is an unitary operator containing the eigenfunctions and A^, I'® 

and I'^ are operators containing the corresponding eigenvalues of the five operators 
under consideration, and 

AP^ = , ( 6 ) 

if the eigenvalues are arranged in the order determined by the order of eigenfunction 

in V. 

Finally, if p, is symmetric, then 


T^p/ ^ Tk 


(7) 


(e.g. the sequence of eigenvalues of Tl k majorizes the sequence of eigenvalues of 

Trpi)- 

Proof Since A^ is a projection matrix, A^{Lf{'P^, p)) C L^{V^,p), this constrains 
the action of the integral operator Tk onto the range of A'^, which is a subspace of 
Lf{V'^,p). The eigenfunctions (f>i associated with n onzero eigenvalues Xi of 
belong to this subspace, and satisfy (lAronszainl lll948h : 


T^Kfi - Xi4>i =pwithp_L A^{L'^{'P'^,p)) . 

Since A^((V ^, p)) C L^{V^,p), we can use a well known theorem (see e.g. 
Aronszainl l 1948ll and references therein) to obtain; 


and Xf < Af for i = 1, 2,... 




and the case with T^s goes analogously. 

We observe that, with symmetric p, the operators S and A are self-adjoint, and 
hence orthogonal projections. Furthermore, they are orthogonal with each other, that is 


SA = AS = 0 , 


( 8 ) 


and hence the set {S, A, T^s, T^p/} of operators commutes, and therefore, 
according to Corollary 12.61 they share the same eigenfunctions. 

Finally, Q follows from the Uhlmann’s theorem, since we can define an operation; 

T:T{L\v\p))^T{L\V\p)) 

Tk ^ i (Tk + PTkP) 

for which Tkp^ = r(TK) and which is doubly stochastic, because it is both trace 
preserving (as shown above), unital due to PP = I, and the set of Kraus operators 
fulfilling (|2]i is {^I, ^P}. □ 

It is interesting to note the following observation about the common eigensystem 
of the operators {S, A, Tks, Tk^, Tkp^ }: 

Remark 3.7. All the eigenfunctions ofTKPi are either symmetric or antisymmetric, 
and the corresponding eigenvalues are cleared to zeros when one applies S or A. Since 
S and A are orthogonal projections, their eigenvalues are either zeros or ones, and the 
ones in S correspond to the symmetric functions and zeros to the antisymmetric ones, 
and vice versa for A. 
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4 Error Bounds 


Let 

Hf)= [ L{f{x),y)dp{x,y), (9) 

Jxxy 

where L is a loss function, denote the expected risk of /. For the squared loss, the 
minimizer of ® is the so-called regression function 


f*{x) = / ydp{x,y). 

Jy 

The hypothesis spaces under our consideration in this paper do not necessarily include 
the regression function, and hence another quantity of interest is the error associated to 
the given RKHS T-L: 

;»”(/)■ 

If we have a prior knowledge, for example, that the underlying regression function is 
anti-symmetric, then we can immediately assume that the errors associated to a kernel 
K and its anti-symmetric counterpart are equal. That is, we do not lose any ex¬ 
pressiveness by restricting our hypothesis space to anti-symmetric functions. The next 
question is whether we can gain anything with the restriction. 

Our next quantity of interest is the minimizer fT,\ of the regularized empirical risk 
on a training set T and a regularization parameter A. In particular, we aim to analyze 
the effect of using either the permutation-invariant, symmetric, or anti-symmetric forms 
instead of the original kernel on the discrepancy 


- inf/e-H(rf)F(/) 


known in the literature as the e xcess error. 

Following lHsu et al.lll2014ll . we split the consideration of the excess error into three 
parts: 


^ ^rg T ^bs T ^vr ^rg^bs \/^rg^^vr \/ ^bs^vr) 5 

where erg, tbs, and e^r are, respectively, the bias caused by regularization, the bias 
caused by the random drawing of the training inputs, and the variance caused by noise 
in the outputs. We briefly consider each of these in turn in the following subsections. 


4.1 Effective Dimension 


As discussed by Hsu et al. i2014ll and also e arlier by many other authors (see e.g. 
(IZhand 11200511 . ICaponnetto and De Vitol 0200711 1. the variance term eyr can be roughly 
characterized with a concept known as the effective dimension: 


Definition 4.1 (Effective dimension). The effective dimension D{K, p, A) of the kernel 
K with respect to the measure p and the regularization parameter value X > 0 is 
defined as: 


D{K,p,X) 


E 


A. 

Ai -f A 


where Xi are the eigenvalues of the integral operator of the kernel K. 
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The next result shows that the eigenvalue majorization of the integral operators of 
kernels is connected to the effective dimension of the kernels: 


Proposition 4.2. Let Ki and K 2 be kernels, and Ti and T 2 their integral operators 
with measure p, with trace(Ti) = trace(T 2 ). Then, 


D{K2,p,\)> VA>0. 

Proof. We recollect the following result recently proven bv iMari et alJ ll2014 l that ex¬ 
tends a well-known result for sequences of infinite lengths. Let r = (ri)“ G Cq and 
S = (Si)fci e cl with Ti = Si = 1 . Then, 


r ^ s ^ p{ri) > ^ p{si) . 


for all real non-negative strictly concave function p defined on the segment [0,1]. The 
result follows immediately (with scaling the eigenvalues), since p{r) = r/(r + A) is 
real-valued, non-negative and strictly concave for r, A > 0. □ 


Given the above analysis of the eigensystems of the considered pairwise kernels, 
we end up to the following results about their effective dimensions: 

Theorem 4.3. If K is a pairwise kernel, then 

D{K^,p,X)<DiK,p,X) (10) 

and 

D{K^,p,X)<D{K,p,X). (11) 

If the measure p is symmetric, we also have 

DiK,p,X)<D{K^^,p,X). (12) 

Proof The inequalities (fTOl i and (fTTTi follow straightforwardly from (|5]l, and the in¬ 
equality (fT2li follows from Corollary 14, 2l and (|7]i. □ 


4.2 Approximation Analysis 


We next rurn our atte ntion to the bias caused by the random drawing of the training 
inputs. According to Hsu et al. I 2014ll . this bias is affected, in addition to the above 
considered effective dimension and the regularization bias considered below, by the 
approximation error caused by the hypothesis space being too limited. In contrast, the 
approximation error is zero if the hypothesis space contains the regression function or 
functions that can approximate it arbitrarily closely. To guarantee that the hypothesis 
space is expressive enough to approximate any function, we may use kernels that are 
universal. On the other hand, if we have prior knowledge about the properties of the 
regression function, for example, if we know it to be symmetric or anti-symmetric, we 
may restrict the hypothesis space accordingly. 


Related to the bias by random design, we also point out a recent result bv iBrunner et al 


0201211 which shows an equivalence between the use of a symmetric pairwise kernel and 
the original kernel with a symmetrized training set. We omit its detailed consideration 
here due to lack of space. 

To formalize these concepts, we first recollect the definition of universal kernels. 
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Definition 4.4 (ISteinwarll 1120021] ). A continuous kernel K on a compact metric space 
X (i.e. X is closed and bounded) is called universal if the RKHS induced by K is dense 
in C{X), where C{X) is the space of all continuous functions / : A" —>■ R. 


Accordingly, the hypothesis space induced by the kernel K can approximate any 
function in C{X) arbitrarily well, and hence it is said to have the universal approxi¬ 
mating property. 

While the universal approximating property guarantees that the RKHS can, in the¬ 
ory, learn any concept, we do not necessarily have a need for it if we have prior knowl¬ 
edge about certain properties of the concept to be learned. Thus, we also define an 
analogous concept for non-universal kernels: 

Definition 4.5. Let K be a continuous kernel K on a compact metric space X and let 
T C C{X). If J- C 'H[K), the definition of RKHS indicates that, for every function 
f £ C{X) and every e > 0, there exists a set of input points {xj}™ £ X and real 

numbers with m £ N, such that 


max 

x&X 


m 

/(x) - y^ aiK{xi,x) 

i=l 


< e. 


Accordingly, the hypothesis space induced by the kernel K can approximate any func¬ 
tion in T arbitrarily well, and hence we say that the RKHS 'H{K) can approximate 


T. 


Armed with the above definitions, we present the next result characterizing the 
approximation properties of the symmetric and anti-symmetric kernels: 

Theorem 4.6. Let T C ClfP'^) be an arbitrary set of continuous functions, and let 

S = {t \ r € T, t{v, v') = r{v, v') + r{v', u)} 

A = {t \ r € K, t{v, v') = r{v, v') — r{v', u)} 


be the sets of symmetric and antisymmetric functions determined by T. Moreover, 
let K{v,v',v,v') be a kernel on and let {v,v',v,v') and K^{v,v',v,v') be 
the corresponding symmetric and antisymmetric kernels. If X H (K), then S C 
n (K^) andACH (K^). 


See Se ction l5.2l for a proof. This theorem is a generalization of the result of lWaegeman et al 


I 2 OI 2 I] . who proved that this result holds for the special cases of the symmetric and 
anti-symmetric Kronecker product kernel. 

As an example of an anti-symmetric kernel popularly used in the m achine learning 


litera ture, we may consider the following one originally analyzed by IIHerbrich et al 


2 OOOI] . Given a base kernel K^{v, v) over the objects, the pairwise learning to rank 


approach corresponds to using the following transitive pairwise kernel: 


1 

4 


{K^{v, v) — K^{v', v) — K^{v, v') -f K^{v', v')) 


In the theoretical framework considered in this paper, this kernel can be interpreted 
as the anti-symmetrization of the pointwise kernel K{v,v',v,v') — K^{v,v), that 
simply ignores the second pair. The approximation properties of this kernel are thus 
formalized in the following corollary: 
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Corollary 4.7. Let 


7^ = {i I t G C{V^), 3r G C{V), t{v, v') = r{v) — r{v')} 


be the set of all continuous ranking functions fro m to R. 
universal, then the RKHS of the transitive kernel nHerhrich et 

Kt{v,v' ,v,v') = 


If K^jv, v) on V is 
all 20001 defined as 


1 

4 


{K^[v, v) — K^{v', v) — (v, v') + {v', v')) 


(13) 


can approximate TZ. 


Proof. We select 


= {/ I / G C{V^),3r G C{V)Avy) = rW} 


and apply Theorem l4.6l 


□ 


4.3 Regularization Bias 


The follo wing expre s sion o f the bias caused by regularization is known in the literature 
(see e.g. Hsu et al. ||2014|] ') but we show it here for the completeness, because we 
express it in somewhat different form. 


Lemma 4.8. Let f be the regression function and Tk the integral operator of a kernel 
K. Further, let ff' be the minimizer of the regularized mean squared error 


[ if-vj.hfdti + xwhUiK), 

Jx 


(14) 


and let Then, f^ can be expressed as 


/^=VA(A + AI)-iV*/, 


and the bias caused by regularization as 

ergif, Tk,X) = x^ (/, (T^ + AI)-' /) , 

where Tk = VAV* is the eigen decomposition ofTx, and the operator-vector prod¬ 
ucts are in Lf{X, p). 

See Section l53] for a proof. 

Interestingly, if the same value of the regularization parameter is used for both 
the original kernel and its permutation invariant, symmetric or anti-symmetric forms, 
the type depending on the prior knowledge we have about the regression function, the 
regularization bias may get worse even if we use the correct type of modification of the 
kernel. In fact, one can find examples of symmetric regression functions for which the 
kernel symmetrization decreases the bias and other symmetric regression functions for 
which the bias is increased. However, the increase or decrease of the bias is rather mild 
and it is characterized by the following result: 
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Theorem 4.9. Let us assume K max(^, „/)g-p 2 K{v,v',v,v') = 1. This assumption 
can be done without losing generality due to the kernels being bounded. 

If the measure p is symmetric and the regression function is symmetric (antisymmetric), 
the bias caused by regularization is the same for the kernels and and 

K^) with all values of X. Moreover, the bias caused by regularization with the amount 
Xfor the kernel K and has the following relationship: 

- 4A2 + 4a) ■ 

See Section l5^ for a proof. 


5 Proofs 

5.1 Proof of Theorem 13.51 

Proof We begin by considering the integral operator of the anti-symmetric kernel. For 

h,g € L‘^{'P‘^,p), 

= J ^g{v,v) ^K^{v,v',v,v')h{v,v')dp^ dp 

g{v, v)K^{v, v',v,v')h(v,v')dpdp 


/•pS Jp2 




'p2 J'p2 


g(v, v')K(v, v' , V, v')h{v, v')dpdp 


/p2 J'p2 


'p2 J'p2 


f'p2 J'p2 


g{v,v')K(v', V, V, v')h{v, v')dpdp 
g{v, v')K{v, v',v',v)h(v,v')dpdp 
g{v,v')K{v', V, v',v)h(v, v')dpdp 


l-p^ 


K(yy^i)h{v,v')dp, / K(^^y)g{v,v')dp 

K(y^y:)h{v,lf)dp, / K(^y,^.^)g{v',v)dp 

.2 J'p2 

,v)dp, / K(^yy)g{v,v')dp 

.2 J'p2 

K(yf^y)h{v\v)dfi, / 

•2 7p2 
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where = i - if(rj',i 7 )). Then, 


^{v,v')Hv,v')dniv,v') 


i/ K(y^y>)h{v,v')dfj,{v,v') - I- K(y>^y)h{v,v')dfi{v,v') 

^ J-pS ^ J'p2 

i/ K(^^y,'jh{v,v')fi{v,v')d{v,v') - ^ [ K(y,v')h{v',v)fi{v',v)d{v,v') 
Z Jj )2 Z Jj )2 


\l 


K{v,v') ( h{v,v') - jj^^h{v',v) ) ^i{v,v')d(v,v') 


= [ K(^^y^){A>^h{v,v'))dfj,{v,v') 

7p>2 


/p>2 


Accordingly, we observe that: 

=(/i, A^5)i2(-p2_^) , 

that is, the integral operator of the anti-symmetric kernel is a = A^*Tif A^. 

The integral operators of the other kernels can be constructed analogously via the 
feature mappings: 


<^KPh=UK{,P^h) 

=Uif (S'^/i) 

^KPih=lJK , 

where ^ ^ is the operator obtained by stacking the operators I and P^. 

Finally, it is straightforward to check that and A^ are projections due to their 
idempotence, that is, and = A^. □ 


5.2 Proof of Theorem 14.61 


Proof. We first consider the RKHS of the permutation invariant kernel K^^{v, v' ,v,v') 
given in Definitionl3.2l A ccording to the theorem concerning sums of reproducing ker¬ 
nels by Aronszain 1 195(]ll . the RKHS of the permutation invariant kernel can be 
written as the following space of functions: 


n{K^^) = n{K + KP) 

= {fi + f2.fi&n{K)j2en{KP)}. 


This, together with the assumption C "H (A), implies 

J^cn (K^^) . 


(15) 
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Let e > 0 and t € ^ be an arbitrary function for which t{v, v') = r(v, v') — r{v', v), 
where r ^ T. According to (O, we can select a set of pairs {{vi,v[)}^i and real 
numbers such that the function 

m 

u{v,v') = y^aiK^\v,v',Vi,Vi) 

belonging to the RKHS of the kernel fulfills 

max {|r(u, w') — mIu, u')|} <-e . (16) 

{v,v')&V'^ 2 


Let 


It follows from (fT^ that 


/i(u, v') = u{v, v') — u{v', v). 


. {|f(u,u') -< e. 

{v,v')£V‘^ 

We observe that h can be written in terms of the kernel as 

m 

h{v,v') = y^aiK^^{v,v',Vi,v'i) 

m 

i^l 

m 

= '^aiK^{v,v' ,v„v'i} 
i=l 

which proves the claim for the anti-symmetric kernels. The proof for the symmetric 
ones is analogous. □ 


5.3 Proof of Lemma 14.81 


Proof. Starting from the form given by ICucker and Smalel 11200211 and applying the 
Sherman-Morrison-Woodbury fomula for operators ODenaBOl ill , we get 


= ^Kh>' 

= UJ,(UxUJ, + AI)-iUx/ 
= lJ*j,UK{VK'^K + Xl)-^f 
= TK{TK + Xl)-^f 
= VA(A-f AI)-iV*/ . 
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The bias caused by regularization is the squared error between the regression function 
and UJf 

ergif,T,X)= [ 

JX 

= (/-/^/-/^) 

= (/ - T(T + AI)- V, / - T(T + AI)- V) 

= (/, V (I - 2A(A + AI)-i + A2(A + AI)-2) V*/) 

= (/’V(I-A(A + AI)-i)V*/) 

= A2(/,V(A + AI)-'V7) 

= A2(/,(T + AI)-V) , 

where the products are in L^{X, /i). □ 

5.4 Proof of Theorem 14.91 

Proof. Let the regression function be symmetric, that is, it can be written as / = S/. 
Then, 


= VA(A + AI)-iV7 
= VA(A +AI)-^V*S/ 

= VA^s(A^s+AI)-iV7 
= Iks j 


where the second last inequality is due to the and hence also the bias caused by reg¬ 
ularization is the same for the kernels and . The proof is analogous for the 
anti-symmetric case. 

Let M be an operator for which 0 < al < M < /3I, where a and /3 are, respec¬ 
tively, the smallest and largest eigenvalues of T -f AI. We first recollect some matrix 
inequalities we use in the proof. 


Choi’s inequality and Kadison’s inequality (see e.g. IChoil 11197411 1 indiate that if 
M > 0 and T' is positive and unital linear map, then 


T'(M“i) > T'(M)- 
T'(M2) > 


(17) 

(18) 


Let 0 < a < M < /3 and T* be positive unital linear map, Marshall and Olkin 1 199011 
proved the following operator Kantorovich type of inequality: 


T'(M-i) < 


(a + /3y 

4a/3 


T'(M)- 


(19) 


According to the Lowner-Heinz Theorem (see e.g. Carlenl l 2010ll l. if M and N are 
operators and M > N > 0, then matrix inversion reverses the positive-definite order, 
that is. 


M-i < N-i 


( 20 ) 
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Further. iFuiii et alJ 019971] proved the following Kantorovich type of inequality: 

{a + Pf 


4a/3 




( 21 ) 


Armed with the above matrix inequalities, we get the following combined results: 


4a2/32 


> 


(a2 + / 32 ): 


■4'(M-2) , 


where the hrst inequality is due to combining (fTsT i with (l20l i. and the second inequality 
is due to dia. 




(g ^ + /3 




< 


AajS 

{a + P? 

4g/3 


4'(M-1)2 
4'(M-2) , 


where the hrst inequality is due to combining the Choi’s inequality (fTTl l with the in¬ 
equality (I 2 TI 1 . and the second inequality is due to the Kadison’s inequality (fTSl) . 

Let 'F(M) = SMS + AM A, which is a unital, positive and linear mapping on 
fi)). Then, we have T^p/ + AI = + AI). Combining the above 

results, we get 


^rgif, T^pr, A) 


= A2 (/, (T^P. + AI)-V) 
= X^{f,A>{TK + Xl)-^f) 


where the second last equality is due to the assumption of the regression function being 
symmetric. The lower bound can be shown analogously. 

The limit of the smallest eigenvalue of T is 0, and hence that of T + AI is A. 
Moreover, due to K max(„ „/)g'p 2 K{v,v',v, v') = 1, the largest eigenvalue of T + AI 
is at most 1 + A. The claimed relationship is obtained by substituting A and 1 + A to a 
and p. □ 
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